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USER INDEPENDENT, REAL-TIME SPEECH 
RECOGNITION SYSTEM AND METHOD 



BACKGRO UND OF THF, INVFNTION 
1. Technical Figlri - 

The present invention relates generally to speech recognition. More particularly the 
present invention is directed to a system and method for accurately recogni Z ing continuous 
human speech from any speaker. 

2 - Bnckgroii nd Information - 

Linguists, scientists and engineers have endeavored for many years to construct 
machines that can recognize human speech. AJthough in recent years this goal has begun to 
be realized in certain respects, currently available systems have not been able to produce 
results that even closely emulate human performance. This inability to provide satisfactory 
speech recognition is due primarily to the difficulties that are involved in extracting and 
.dentifying the individual sounds that make up human speech. These difficulties are 
exacerbated by the fact there are such wide acoustic variations that occur between different 
speakers. 

Simplistically, speech may be considered as a sequence of sounds taken from a set of 
forty or so basic sounds called "phonemes.- Different sounds, or phonemes, are produced by 
varyng the shape of the vocal tract through muscular control of the speech articulators (lips 
tongue, jaw. etc.). A stream of a particular set of phonemes will collectively represent a word 
or a phrase. Thus, extraction of the particular phonemes contained within a speech signal is 
necessary to achieve voice recognition. 

However, a number of factors are present that make phoneme extraction extremely 
Affiant. For instance, wide acoustic variations occur when the same phoneme is spoken by 
different speakers. This is due to the differences in the voca. apparatus, such as the vocal- 
tract length. Moreover, the same speaker may produce acoustically different versions of the 
same phoneme from one rendition to the next. Also, there are often no identifiable boundaries 
between sounds or even words. Other difficulties result from the fact that phonemes are 
spoken with wide variations in dialect, intonation, rhythm, stress, volume, and pitch Finally 
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the speech signal may contain wide variations in speech-related noises that make it difficult 
to accurately identify and extract the phonemes. 

The speech recognition devices that are currently available attempt to minimize the 
above problems and variations by providing only a limited number of functions and 
5 capabilities. For instance, many existing systems are classified as "speaker-dependent" 
systems. A speaker-dependent system must be "trained" to a single speaker's voice by 
obtaining and storing a database of patterns for each vocabulary word uttered by that 
particular speaker. The primary disadvantage of these types of systems is that they are "single 
speaker" systems, and can only be utilized by the speaker who has completed the time 

10 consuming training process. Further, the vocabulary size of such systems is limited to the 
specific vocabulary contained in the database. Finally, these systems typically cannot 
recognize naturally spoken continuous speech, and require the user to pronounce words 
separated by distinct periods of silence. 

Currently available "speaker-independent" systems are also severely limited in 

15 function. Although any speaker can use the system without the need for training, these 

systems can only recognize words from an extremely small vocabulary. Further, they too 
require that the words be spoken in isolation with distinct pauses between words, and thus 
cannot recognize naturally spoken continuous speech. 

20 BRIEF SUMMARY OF THE INVENTION 

Briefly summarized, in the preferred embodiment, an audio speech signal is received 
from a speaker and input to an audio processor means. The audio processor means receives 
the speech signal, converts it into a corresponding electrical format, and then electrically 
conditions the signal so that it is in a form that is suitable for subsequent digital sampling. 
25 Once the audio speech signal has been converted to a representative audio electrical 

signal, it is sent to an analog-to-digital converter means. The A/D converter means samples 
the audio electrical signal at a suitable sampling rate, and outputs a digitized audio signal. 

The digitized audio signal is then programmably processed by a sound recognition 
means, which processes the digitized audio signal in a manner so as to extract various time 
30 domain and frequency domain sound characteristics, and then identify the particular phoneme 

sound type that is contained within the audio speech signal. This characteristic extraction and 
phoneme identification is done in a manner such that the speech recognition occurs regardless 
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of the source of the audio speech signal. Importantly, there is no need for a user to first 
"train" the system with his or her individual voice characteristics. Further, the process occurs 
in substantially real time so that the speaker is not required to pause between each word, and 
can thus speak at normal conversational speeds. 

In addition to extracting phoneme sound types from the incoming audio speech signal, 
the sound recognition means implements various linguistic processing techniques to translate 
the phoneme string into a corresponding word or phrase. This can be done for essentially any 
language that is made up of phoneme sound types. 

In the preferred embodiment, the sound recognition means is comprised of a digital 
sound processor means and a host sound processor means. The digital sound processor 
includes a programmable device and associated logic to programmably carry out the program 
steps used to digitally process the audio speech signal, and thereby extract the various time 
domain and frequency domain sound characteristics of that signal. This sound characteristic 
data is then stored in a data structure, which corresponds to the specific portion of the audio 
signal. 

The host sound processor means also includes a programmable device and its 
associated logic. It is programmed to carry out the steps necessary to evaluate the various 
sound characteristics contained within the data structure, and then generate the phoneme 
sound type that corresponds to those particular characteristics. In addition to identifying 
phonemes, in the preferred embodiment the host sound processor also performs the program 
steps needed to implement the linguistic processing portion of the overall method. In this 
way, the incoming stream of phonemes are translated to the representative word or phrase. 

The preferred embodiment further includes an electronic means, connected to the 
sound recognition means, for receiving the word or phrase translated from the incoming 
stream of identified phonemes. The electronic means, as for instance a personal computer, 
then programmably processes the word as either data input, as for instance text to a 
wordprocessing application, or as a command input, as for instance an operating system 
command. 



BRIEF DESCRIPTI ON OFTHF. DRAWINGS 
In order that the manner in which the above-recited and other advantages and objects 
of the invention are obtained, a more particular description of the invention briefly described 
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above will be rendered by reference to a specific embodiment thereof which is illustrated in 
the appended drawings. Understanding that these drawings depict only a typical embodiment 
of the invention and are not to be considered to be limiting of its scope, the invention in its 
presently understood best mode will be described and explained with additional specificity and 
5 detail through the use of the accompanying drawings in which: 

Figure 1 is functional block diagram of the overall speech recognition system; 

Figure 2 is a more detailed functional block diagram illustrating the speech recognition 

system; 

Figures 3A through Figure 3Y is a schematic illustrating in detail the circuitry that 
10 makes up the functional blocks in Figure 2; 

Figure 4 is a functional flow-chart illustrating the overall program method of the 
present invention; 

Figures 5 A-5B is a flow-chart illustrating the program method used to implement one 
of the functional blocks of Figure 4; 
15 Figures 6-6D is a flow-chart illustrating the program method used to implement one 

of the functional blocks of Figure 4; 

Figure 7 is a flow-chart illustrating the program method used to implement one of the 
functional blocks of Figure 4; 

Figures 8-8D is a flow-chart illustrating the program method used to implement one 
20 of the functional blocks of Figure 4; 

Figure 9 is a flow-chart illustrating the program method used to implement one of the 
functional blocks of Figure 4; 

Figures 10- 10C is a flow-chart illustrating the program method used to implement one 
of the functional blocks of Figure 4; 
25 Figure 1 1 is a flow-chart illustrating the program method used to implement one of 

the functional blocks of Figure 4; 

Figures 12A-12C are x— y plots of example standard sound data. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

30 The following detailed description is divided into two parts. In the first part the 

overall system is described, including a detailed description of the functional blocks which 
make up the system, and the manner in which the various functional blocks are interconnected. 
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In part two. the method by which the overal. syste m is probably controHed to achieve 
real-time, user-independent speech recognition is described. 

I. THE SYSTEM 
Reference is firs, made ,o Figu re ,. where one prKe „, |y prefOTed 
over*, speech recognition sys ,em is de^ed gsneral|y „ |ft Tfee |q 
aud„> processor means fo, recetving an audio speech signa, and for convening ,ha, signa, i„,o 
a representative audio decried sign,,. In the preferred ^ the audj(> , 
means ,s comprised of a means for inputting an audio signa, ami convening i, 1o .„ electrical 
*gn* ^ as . srandard condenser microphone shown ge„er„ ly a, ,2. Various o.her inpu. 
devices cou d a,so he utilized ,o inpu, an audio signa,. including, hu, no, limited ,o sue 
dev,ees as a d,c,aphone. telephone or a wireless microphone 

In addition ,o microphone .2. ,he audio processor means also preferably comprises 

3P 7 na ' e a " di ° ^ 14 ™ S " - udio 

electnea, ,g„a, generated by ,he microphone ,2. and then function so as to condition the 

s,gna, so that „ is in a suitable electrical condition for digital sampling 

The audio processor circuitry 14 is then electrically connected ,„ analog-o-digita, 
convener mean, .Uustrated in the prefened ^odimen, as A/D conversion circuit^ 3< This 
cnet^ry, lhe audio „ ^ ^ . ^ _ ^ 

to a digttal format, outputting a digitized audio signal 

prefe Jeielr" ^ ^ " ^ *° ' - ~* »™ch in ,he 

prefened emhodmten, conesponds ,o ,he hlock designated a, , 6 and referred to as the sound 

recogmtion processor circuit,. Ce^lly, the sound recognition processor eircui tIy 
programmahly analyzes the digitized version of the audio signa, i„ . roanner s0 ,„„ ^ 
extract venous acoustical characteristics from the signal Once the necessary characterise 
are ob-amed. the eircui,. Id can idenIify ,he specific phoneme sound types Lained w 
he audto speech s,gna,. taponantly. this phoneme identification is done withou, referee 
to the speech characteristics of the individual speaker, and is done in a manner JZTZ 
P oneme ,de„„fiea,ion occurs in tea, time, therehy allowing the speaker ,„ spe ak a, a no J 
rate of conversation. 

The sound recognition processor circuitry ,6 obtains .he necessary acoustical 
ch.rac.enst.es in ,wo way, Firs, i, evalua.es .he ,ime domain representation of the audio 
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signal, and from that representation extracts various parameters representative of the type of 
phoneme sound contained within the signal. The sound type would include, for example, 
whether the sound is "voiced," "unvoiced," or "quiet." 

Secondly, the sound recognition processor circuitry 16 evaluates the frequency domain 
representation of the audio signal. Importantly, this is done by successively filtering the time 
domain representation of the audio signal using a predetermined number of filters having a 
various cutoff frequencies. This produces a number of separate filtered signals, each of which 
are representative of an individual signal waveform which is a component of the complex 
audio signal waveform. The sound recognition processor circuitry 16 then "measures" each 
of the filtered signals, and thereby extracts various frequency domain data, including the 
frequency and amplitude of each of each signals. These frequency domain characteristics, 
together with the time domain characteristics, provide sufficient "information" about the audio 
signal such that the processor circuitry 16 can identify the phoneme sounds that are contained 
therein. 

Once the sound recognition processor circuitry 16 has extracted the corresponding 
phoneme sounds, it programmably invokes a series of linguistic program tools. In this way, 
the processor circuitry 16 translates the series of identified phonemes into the corresponding 
syllable, word or phrase. 

With continued reference to Figure 1, electrically connected to the sound recognition 
processor circuitry 16 is a host computer 22. In one preferred embodiment, the host 
computer 22 is a standard desktop personal computer, however it could be comprised of 
virtually any device utilizing a programmable computer that requires data input and/or control. 
For instance, the host computer 22 could be a data entry system for automated baggage 
handling, parcel sorting, quality control, computer aided design and manufacture, and various 
command and control systems. 

As the processor circuitry 16 translates the phoneme string, the corresponding word 
or phrase is passed to the host computer 22. The host computer 22, under appropriate 
program control, then utilizes the word or phrase as an operating system or application 
command or, alternatively, as data that is input directly into an application, such as a 
wordprocessor or database. 

Reference is next made to Figure 2 where one presently preferred embodiment of the 
voice recognition system 10 is shown in further detail. As is shown, an audio speech signal 
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in-** a, microphone ,2. or similar device. The re „, ive ^ 
then passed ,o *. audio processor circoUry 16 ponio „ „ ^ 
embod,men, of ,h,s circui, the audio electnca, sign. i s input 1D . signal ^ 
*r amphrymg the ,o a soluble leva,, such as ampler circui, 26. Ahhough a -umber 
of d^erea, c,rc„ lts cou,d he used ,„ imp.emen, ,hU fraction, in ,He preferred embodimen, 
amphfier c.rcu„ 26 consist of a , wo stage opera,ioaa. amplifier configuration, arranged so 
as .o prov.de an over*, gain of approximate* 3 00. Wi.h such a configure,™ wUh a 
nacrophone ,2 ^ of ,pp™,y 60 dhm. ,he amplifier circui, 26 wig produce an ourpu 
signal at approximately line level. P 

means .T^" en,b ° di " ,em - *" ""■*«- — — *- signal is ,he„ passed ,o a 

nr. o, " pu, ,eve ' ° f ,he audi ° siBn " s ° as ,o - «— — 

o o,her components confined within ,he system ,0. The Wing means is comprised of a 
hmmng amphfier circi, 23. which car, he designed using . varie,y of ,ech„i q ue, one Imp 
of wh,ch,s shown in, he detailed schematic of Figure 3 

Ne* the amplified audio electrica, signal is passed to a filter means for fihering high 
fieouencea from ,he Cecmca, audio signa,. as for anl , aliasing fl „ er cM * ^ 

crcu,, „h,ch again can he designed using any one of a number of circui, designs marl 
hm,,s *e mghes, frequency that can be passed on to other circuitry wi.hin ,he system 7t 

The audio Cectrica, signet which is in „ analog mrma,, is ,he„ passed ,o a a„a,og.,o- 
d,g,.a, convener means for digm*„g the ^ whic „ „ ^ „ ^ « » 

.n the prefer embodimen, A/D conversion circui, 34 utihaes a ,6-bi, analog ,o I , 
convener device, which is base, on Sigma-D«„» sampbng technology Funhe , Z 1 

enors At a mrnimum. ,he sampiing ra.e ahou.d he a, least twice ,h, incoming sound 1 J 

H^es, freouency (,he Myoues, ra,e). and in .he preferred embodimen, ,he san^ng™ 

«• . *Ha . wd, be apprecUted ,ha, any one of a number of A/D conversion devicl C 

commercny avadable could be use,. A presendy preferred componen, along £ Z 

vanous suppon circuity, is shown in ,he detailed schematic of Figure 3. 

With continued reference to Figure t h™;™ -~ , . 
A - ■ te ° ' hav,ns conv erted the audio electrical sienal tn 
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16. In the presently preferred embodiment, the sound recognition processor circuitry 16 is 
comprised of a digital sound processor means and a host sound processor means, both of 
which are preferably comprised of programmable devices. It will be appreciated however that 
under certain conditions, the sound recognition processor circuitry 16 could be comprised of 
suitable equivalent circuitry which utilizes a single programmable device. 

In the presently preferred embodiment, the digital sound processor means is comprised 
of the various circuit components within the dotted box 18 and referred to as the digital sound 
processor circuitry. This circuitry receives the digitized audio signal, and then programmably 
manipulates that data in a manner so as to extract various sound characteristics. Specifically, 
the circuitry 18 first analyzes the digitized audio signal in the time domain and, based on that 
analysis, extracts at least one time domain sound characteristic of the audio signal. The time 
domain characteristics of interest help determine whether the audio signal contains a phoneme 
sound that is "voiced," "unvoiced," or "quiet." 

The digital sound processor circuitry 18 also manipulates the digitized audio signal so 
as to obtain various frequency domain information about the audio signal. This is done by 
filtering the audio signal through a number of filter bands and generating a corresponding 
number of filtered signals, each of which are still in time domain. The circuitry 18 measures 
various properties exhibited by these individual waveforms, and from those measurements, 
extracts at least one frequency domain sound characteristic of the audio signal. The frequency 
domain characteristics of interest include the frequency, amplitude and slope of each of the 
component signals obtained as a result of the filtering process. These characteristics are then 
stored and used to determine the phoneme sound type that is contained in the audio signal. 

With continued reference to Figure 2, the digital sound processor circuitry 18 is shown 
as preferably comprising a first programmable means for analyzing the digitized audio signal 
under program control, such as digital sound processor 36. Digital sound processor 36 is 
preferably a programmable, 24-bit general purpose digital signal processor device, such as the 
Motorola DSP56001 However, any one of a number of commercially available digital signal 
processors could also be used. 

As is shown, digital sound processor 36 is preferably interfaced - via a standard 
address, data and control bus-type arrangement 38 - to various other components. They 
include: a program memory means for storing the set of program steps executed by the DSP 
36, such as DSP program memory 40; data memory means for storing data utilized by the 
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36 ' SUC " " ° SP da " 4 * >" d •**• con.ro, loS ic 44 for interning the 

™ous s,and.rd timing Md control ^ M Jddress an<J 

7" be * °" e 0f Skil1 in «■» « **» Cher comports and fi^ 

could be used in conjunction with the digital sound processor 36. 

With continued referee to Figure 2, in the presently preferred embodiment, the bos. 
sound processor means is compmed of the various circuit components within the dotted box 
20 and referred to as the bos. sound processor circuitry. This hos , soun „ 
20 „ elecncally connected and interfaced, vi, an appropriate hos. interftce 52. ,„ the digital 
sound processor circuitry , 8 . Oeneraily. , his circuity 20 receives the various audio signa, 
charac,ens„c information generated by the digital sound processor circuitry I8 vi a lhe host 
tnterf.ee 52. The host sound processor c ire ui,ry 20 ana,yzes this information and then 
tdennfies dte phoneme sound type( s > tba, are contained within the audio signa, by comparing 
the stgna! characteristics to standard sound data that has been compiled by testin* [ 
representative cross-section of speakers. Having identified the phoneme sounds, the host 
sound processor circuit 20 utilizes various linguistic processing ,ech„i q ues to translate the 
phonemes into a representative syllable, word or phrase. 

The hos, sound processor circuitty 20 is shown a s preferably comprising a second 
programmable means for analyzing ,„, digitized audio signa, characteristics under program 
control, such as hos, sound processor 54. Hos, sound processor 36 is prefL,y a 
programmable, 32-bit genem, purpose CPU device, such as the Mo.orola 6SEC030 
However, any one of a number of commercially available programmable processors could also 



As ,s show*, hos, sound processor 54 is preferably imerfaced - via a s,andard address 
d.,a and con.ro! bus-ype arrangemen, 56 - ,o various Cher component They include a 
program memory means fo, coring the se, of program s,e P s executed by ,he hos, sound 

by the sound processor 54. such as hos, data memory 60; and suitable control lo^ 
for anplemennng ,he various s,a„d»rd timing and control funcions such as address ana da,a 
gattng and mapping. ^ k ^ be appreciaMd „ y _ ^ ^ ^ « 

components and functions could be used in conjunction with the hos, sound processor 54 

Also .ncluded in ,„. preferred embodinten. is a means for interfacing the hos, sound 
processor circuitry 20 to an external e\ortr n • a , 

external electrons dev.ee. In the preferred embodiment, the 
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interface means is comprised of standard RS-232 interface circuitry 66 and associated RS-232 
cable 24. However, other electronic interface arrangements could also be used, such as a 
standard parallel port interface, a musical instrument digital interface (MIDI), or a non- 
standard electrical interface arrangement. 
5 In the preferred embodiment, the host sound processor circuitry 20 is interfaced to a 

electronic means for receiving the word generated by the host sound processor circuitry 20 
and for processing that word as either a data input or as a command input. By way of 
example and not limitation, the electronic receiving means is comprised of a host computer 
22, such as a standard desktop personal computer. The host computer 22 is connected to the 

10 host sound processor circuitry 20 via the RS-232 interface 66 and cable 24 and, via an 
appropriate program method, utilizes incoming words as either data, such as text to a 
wordprocessor application, or as a command, such as to an operating system or application 
program. It will be appreciated that the host computer 22 can be virtually any electronic 
device requiring data on command input. 

15 One example of an electronic circuit which has been constructed and used to 

implement the above described block diagram is illustrated in Figures 3 A-3 Y, These figures 
are a detailed electrical schematic diagram showing the interconnections, part number and/or 
value of each circuit element used. It should be noted that Figures 3 A-3 Y are included merely 
to show an example of one such circuit which has been used to implement the functional 

20 blocks described in Figure 2. Other implementations could be designed that would also work 
satisfactorily. 

II. The Method, 

Referring now to Figure 4, illustrated is a functional flow chart showing one presently 
25 preferred embodiment of the overall program method used by the present system. As is 

shown, the method allows the voice recognition system 10 to continuously receive an 
incoming speech signal, electronically process and manipulate that signal so as to generate the 
phonetic content of the signal, and then produce a word or stream of words that correspond 
to that phonetic content. Importantly, the method is not restricted to any one speaker, or 
30 group of speakers. Rather, it allows for the unrestricted recognition of continuous speech 
utterances from any speaker of a given language. 
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Fo»ow,„g is a gener>l descrip , io „ rf fc ^ 
-hod. A more deuiied description of the preferred program steps used ,o J» ou Z 

-™ I6 po rti o„ of ,he svs.em recdves the ^ ^ ; J - 

-mplmg rate is 44 , kHz. , lthough olher g P rtf ^ 

with the Nv qu u, sampling rate 3,, as , o tvoid J ^ ^* " °»* - - 

is .hen broken-up imo -, ime ? ^ Th ' S *•»— 1— * IN 

th M « ,■ segments. In the preferred embodiment each of 

these time segments contains 10,240 data points or 23? m ;ir , , 

n u . P * 232 m, »'seconds of time domain data 

Each time segment of 1 0 240 H a . a ™;„.„ • »l. 

po,n,s, or 5., milliseconds of time domai „ ^ Va ° «- » da,a 

The ne* s«ep in ,he overall algorithm is shown a, ,04 and is labeL 4 
In .his portion of .he program method, each ,i me siiee is broke w " 
component waveform bv successive* mtenne ,„T '"' 0 indiVidU1 " 

Fron, each of .hese «.,er* ^ T- ' ' * ^ 

«cu signals, the Decompose Junction directly extract, **a:- , 
identifying characteristics by "measurine" #>a h • , . additional sound 

— ™ s in^ion is H 

corresponding data structure. ch " me sllce s 

The next step in the overall algorithm is a, 1 06 and is labeled -Poi„, <■ as ■ 
Intelligence.- ,„ this po nion of lhe program ^ - M— « 

* ntoa, periinen, to the idera , caljo „ „ fc ~ - «a which 

the other time slices are ignored. ln addition to increa „ J ~ 
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subsequent phoneme identification, this function also reduces the amount of processing 
overhead required to identify the sound(s) contained within the time segment. 

Having identified those time slices that are needed to identify the particular sound(s) 
contained within the time segment, the system then executes the program steps corresponding 
to the functional block 110 labeled "Evaluate." In this portion of the algorithm, all of the 
information contained within each time slice's corresponding data structure is analyzed, and 
up to five of the most probable phonetic sounds (L^, phonemes) contained within the time 
slice are identified. Each possible sound is also assigned a probability level, and are ranked 
in that order. The identified sounds and their probabilities are then stored within the particular 
time slice's data structure. Each individual phoneme sound type is identified by way of a 
unique identifying number referred to as a "PASCH" value. 

The next functional step in the overall program method is performed by the system at 
the functional block 1 10 labeled "Compress Phones." In this function, the time slices that do 
not correspond to "points of maximum intelligence" are discarded. Only those time slices 
which contain the data necessary to identify the particular sound are retained. Also, time 
slices which contain contiguous "quiet" sections are combined, thereby further reducing the 
overall number of time slices. Again, this step reduces the amount of processing that must 
occur and further facilitates real time sound recognition. 

At this point in the algorithm, there remains a sequence of time slices, each of which 
has a corresponding data structure containing various sound characteristics culled from both 
the time domain and the frequency domain. Each structure also identifies the most probable 
phoneme sound type corresponding to those particular sound characteristics. This data is 
passed to the next step of the overall program method, shown at functional block 1 12 and 
labeled "Linguistic Processor." The Linguistic processor receives the data structures, and 
translates the sound stream (le^, stream of phonemes) into the corresponding English letter, 
syllable, \ - rd or phrase. This translation is generally accomplished by performing a variety 
of linguistic processing functions that match the phonemic sequences against entries in the 
system lexicon. The presently preferred linguistic functions include a phonetic dictionary 
look-up, a context checking function and database, and a basic grammar checking function. 

Once the particular word or phrase is identified, it is passed to the "Command 
Processor" portion of the algorithm, as shown at functional block 1 14. The Command 
processor determines whether the word or phrase constitutes text that should be passed as 
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da,a ,o a higher ,evel wli c. tioi , ^ „ , wordprocessor „ 

— < ,« - ,o be ^ directly t0 the operali „ g 5ys(em of app|ira(ion 

— itnr rzr;rr vr * *— * 

J * ^ 256 sam P"es of digitized sound data- 5 8 

^seconds of ^ ^ ^ ^ TWs ^ stnjcture « 

~ "" d ta «- " - — y -re ,He .rtous aound charTe H 
and dara rha, can be used ,o idend* ,„« particular pnoneme J"" 

corresponding rime slice. Although o.her inform cou,d a JL MO J" " „ 
s-rucure. TABLE . illu s, rates one preferred ^ ^ T h T 

ssruchrre and Us conrenrs w,„ be discussed „ „„„„ ~ *» 



VA RIABLE NAMF 
I TYPE 

| LOCATION 



CONTENTS 

pToc^s" " V ° iCed ' UnVOiCCd ' **" " Not 
Array location of where Time Slice starts. 



Number of sa m P | e data points ■ Tm& ^ 



Average amplitude of signal in time dr,m a ,» 



FFREQ 
I AMPL 



Fundamental Frequency of signa l 



PMI 



j sumSlope 



Array containing the amplitude of each filtered signal 
Zero Crossing R ate of signal in time domain. 

i^Son 8 maXimUm formant «■* value 



POSSIBLE 
PHONEMES 



Sum of absolute values of filtered signal slopes 



standard for Z n , ^^^^ ^^^^ 
TABLF I 



The various sreps used ,o acconrphsh .he me,hod i„ ustra , ed in Figure 4 „.„ 
be d,scussed ,„ roore delajl by maki „ s spec . fic • 

— ..-.d.app^ed.baube^rir:: 
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which are illustrated in the detailed flow charts contained in Figures 5 through 1 1 are 
intended merely as an example of the presently preferred embodiment and the presently 
understood best mode of implementing the overall functions which are represented by the 
flow chart of Figure 4. 

Referring first to Figure 5A, the particular program steps corresponding to the 
•'Evaluate Time Domain" function illustrated in functional block 102 of Figure 4 are 
shown. As already noted, the Audio Processor 16 receives an audio speech signal from 
the microphone 12. The A/D conversion circuitry 34 then digitally samples that signal at 
a predetermined sampling rate, such as the 44.1 kHz rate used in the preferred 
embodiment. This time domain data is divided into separate, consecutive time segments 
of predetermined lengths. In the preferred embodiment, each time segment is 232 
milliseconds in duration, and consists of 10,240 digitized data points. Each time segment 
is then passed, one at a time, to the Evaluate Time Domain function, as is shown at step 
116 in Figure 5 A. Once received, the time segment is further segmented into a 
predetermined number of equal "slices" of time. In the preferred embodiment, there are 
forty of these "time slices" for each time segment, each of which are comprised of 256 
data points, or 5.8 milliseconds of speech. 

The digital sound processor 36 then enters a program loop, beginning with step 
118. As is indicated at that step, for each time slice the processor 36 extracts various 
time-varying acoustic characteristics. For example, in the preferred embodiment the DSP 
36 calculates the absolute average of the amplitude of the time slice signal (L s ), the 
absolute difference average (L D ) of the time slice signal and the zero crossing rate (Z^) 
of the time slice signal. The absolute average of the amplitude L s corresponds to the 
absolute value of the average of the amplitudes (represented as a line level signal voltage) 
of the data points contained within the time slice. The absolute difference average L D is 
the average amplitude difference between the data points in the time slice ( i.e. . calculated 
by taking the average of the differences between the absolute value of one data point's 
amplitude to the next data point's). The zero crossing rate Z CK is calculated by dividing 
the number of zero crossings that occur within the time slice by the number of data points 
(256) and multiplying the result by 100. The number of zero crossings is equal to the 
number of times the time domain data crosses the X-axis, whether that crossing be 
positive-to-negative or negative-to-positive. 
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The n*,?*^ of thex ^ acoustica , propenjes ^ ^ _ o 
f e„era .ype of sound ^ ^ ^ ^ ^ * 

votced speech sound, is generally found „ ,„ wer tha „ fcr „ un J^„ 

sound, and the amplitude of unvoiced sounds is generally much lower rhan ,he amplitude 
of voiced sound, These genets are tnJ e of al, speaks, and genera , ral J „ av 
been .den-ned by an^ng ^ dala , aka „ ^ , ^ rf * £ 

won,*, and chtldren). By cornpa^ lhe various acoustical propel ZZ 

z:r ranses ' ,he — - - be — ■ ««— -he pa.ico,:; 

Thus, based on the acoustical properties identified in .he previous s>ep ,he DSP 
36 nex proceeds ,o tha, portion of ,he program ,oop ,„a, identifies what type I sou d 
ts contained within the particular time slice t„ ,h» r . We of sound 

, h , „„,, A . " ' he preWd e "*odiment. 'his portion of 

•he code de.em.ne, based on previously identified range, obtained frotn est data 
whether ,he sound contained within ,he tine slice is -,„!«,- -voiced- or "unvoiced 
A. step 120. the abjure awase of ^ ^ ^ 

predetennmed . q uie, level- ^ or "QLEVEL" (U., an amplitude magnitude leveTl 
responds ,o silence). ,„ ,„, preferred embodiment. QLEVEL is e,„ a, , ^ t 
value can generally he anywhere between 200 and 500. „ w», he appreciated ',ha h 
— r >i tt ieve... may va. depending on , h , application or J^J^^ 
-eve, of hac kg round noise, high d o. offset present in ,He A/D conversion or w^ret 
■ncomtng ^ is amplified to a different level), and thus may be a different value IfL 
- ..ban QLEVEL. ,be sound contained within the time slice is deem* ,o be "on 1 ' 
^processor 3d proceeds to step ,22. At step ,22. ,he DSP 3d begins ,o hold I'e 
Bfipns data stntcture for the current time siice within DSP data memoty 42 Here ,! 

If however. Lj is greater than QLEVEL then the sound contained within the time 

" T ^ " PrOCeSS ° r 36 " '° "* - » 2 he I 

sound ,s tnstead a -voiced- sound. To mate this detention, the aero crossing" 

, firs, compared with a predetermined crossing-™ value found » be indi ^ ^a 
voiced sound for most SDeaker* a i„. "laicative of a 

most speakers. A low zero-cross.ng rate implies a low frequency and. 
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in the preferred embodiment, if it is less than or equal to about 10. the speech sound is 
probably voiced. 

IftheZcx does fall below 10, another acoustical property of the sound is evaluated 
before the determination is made that the sound is voiced. This property is checked by 
calculating the ratio of Lp to L s , and then comparing that ratio to another predetermined 
value that corresponds to a cut-off point corresponding to voiced sounds in most 
speakers. In the preferred embodiment, ifWL s is less than or equal to about 15, then the 
signal is probably voiced. Thus, if at step 124 it is determined that Z CR is less than or 
equal to 10 and that 1^L S is less than or equal to about 1 5. then the sound is deemed to 
be a voiced type of sound (g^. the sounds /U/. /d/, /w/, /i/, / e /, etc.) If voiced, the 
processor 36 proceeds to step 126 and places an identifier "V" into the "type" flag of the 
Higgins data structure corresponding to that time slice. 

If not voiced, then the processor 36 proceeds to program step 120 to determine 
if the sound is instead "unvoiced," again by comparing the properties identified at step 118 
to ranges obtained from user-independent test data. To do so, processor 36 determines 
whether Z CR is greater than or equal to about 20 and whether is greater than or 

equal to about 30. If both conditions exist, the sound is considered to be an unvoiced type 
of sound (e.g., certain aspirated sounds). If unvoiced, the processor 36 proceeds to step 
130 and places an identifier "U" into the "type" flag of the Higgins data structure for that 
particular time slice. 

Some sounds will fall somewhere between the conditions checked for in steps 124 
and 128 Z CR falls somewhere between about 1 1 and 19, and Lp/I^ falls somewhere 
between about 16 and 29) and other sound properties must be evaluated to determine 
whether the sound is voiced or unvoiced. This portion of the program method is 
performed, as is indicated at step 132, by executing another set of program steps referred 
to as "Is it Voiced." The programs steps corresponding to this function are illustrated in 
Figure 5B, to which reference is now made. 

After receiving the current time slice data at step 14 1, the processor proceeds to 
step 142, where a digital low pass filter is programmably implemented within the DSP 36. 
The speech signal contained within the current time slice is then passed through this filter. 
In the preferred embodiment, the filter removes frequencies above 3000 Hz, and the zero 
crossing rate, as discussed above, is recalculated. This is because certain voiced fricatives 
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have high fluency noise component lhat Iend IO raise 2ero crossjng ^ ^ 
«~ For these types of sound, MM. of ,he high frequency components wil, drop 
the Z. ,0 a .eve, which corresponds .0 „,her voiced sound, In contrast, if ,he sou™, is 
«, unvo,cec tai,, ,„ OT lhe z„ wi „ temain large , y u „ changed an<j aay >t a ^ 
h,gh level, because ,he majority of ,he signal resides a, higher frequencies 

OncethenewZ,, „ as been ca,cu.a,ed. program s,ep ,44 is performed .o further 
evaluate wheUrer ,be sound is a voiced or an unvoiced fricative. Here, ,be ,ime slj 
absolute _ emplitude poin, is located Once ,oca,ed. ,h. processor 36 compu.es 
* y OS., .he nrs, derivative, of .he ,i„e deflned be.„een .ha. poin. and 
po n on h e waveform ,ha. is ioca.ed a prede,ermi„ed dis.ance from .he minimum poin, 

he preferred embod,men,. lha , predetermined dis,ance is 50 da.a poin., 
distance va,ues could also be used. For a voiced fricative sound, .he slope wil, b 

1 Tl T S '" Ce Si8 " a ' " PCri0diC - eXWbi,S ' ^n, log 

»m P ,„u e. ,„ comras, for an unvoiced fricative sound .he slope wi,, be 

bemuse me s-goa, is no. periodic an., having been f„.ered. wi„ be comprised prij.y Z 
random no.se having a fairly cons.an.ampli.ude. 

Having cdculated ,he 2<« and .he slope, .he processor 36 proceeds .0 s.ep 1 46 
atZT T T ,UdeS '° """^ ~~ <° - of 

about 8. „d .f ,he slope ,s grea,er ,h»n abou. 35. ,hen ,he sound contained within ,he rime 
slice is deemed 10 be voiced and .s. -„ J 

r* vo.ced. and .he correspondmg w flag is se, a, s.ep 150 

Chhervnse. .he sound isoonsidered unvoiced, and ,he -false" flag is se. a, s.ep ,48 Once 
the approp„a,e flag is se, the », s i, Voiced" program sequence returns o hajl 
rounne a, step ,32. shown in Figure 5A g 
Refening ag«„ ,0 Fi^re 5A a, s,ep ,34, based on ,be resuhs of ,he previous s,ep 
.32. ,he appropr.a.e .denrifie, "U- or "V is placed in.o .he -type- fl ag of tne dM a 
-ere for , ha, parricular rime slice. Once i, has been determ JwhetllTh ^ 
. d contamed within ,he parser rime shoe Is voiced, unvoiced or qui e. a „d 
*«. dara stntcture has been upda,ed accordingly a, s.eps ,22. ,26. ,30 or',34 
DSP 36 proceeds ,0 s.ep ,36 and determines wherher ,he las, of .he 256 rime sliced 
,h, parser rime segment has been processed. ,f so. .he DSP 36 returns , m I 
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calling routine (illustrated in Figure 4) as is indicated at step 140. Alternatively, the DSP 
36 obtains the next time slice at step 138, and proceeds as described above. 

Referring again to Figure 4, once the "Evaluate Time Domain Parameters- 
function shown at functional block 102 has been completed, the "Decompose a Speech 
Signal" portion of the algorithm shown at functional block 104 is performed. 

As will be appreciated from the following description, to accurately identify the 
sound(s) contained within the time segment, additional identifying characteristics must be 
culled from the signal. Such characteristics relate to the amplitude and frequency each of 
the various component signals that make up the complex waveform contained within the 
time slice. This information is obtained by successively filtering the time slice into its 
various component signals. Previously, this type of "decomposition" was usually 
accomplished by performing a Fast Fourier Transform on the sound signal. However, this 
standard approach is not adequate for evaluating user-independent speech in real time. 
For many sounds, accurate identification of the individual component frequencies is very 
difficult, if not impossible, due to the spectral leakage that is inherently present in the 
FFTs output. Also, because the formant signals contained in speech signals are amplitude 
modulated due to the glottal spectrum dampening and because most speech signals are 
non-periodic then, by definition, the FFT is an inadequate tool. However, such 
information is critical to accomplish user-independent speech recognition with the 
required level in confidence. 

To avoid this problem, in the preferred embodiment of the Decompose a Speech 
Signal algorithm, a FFT is not performed. Instead, the DSP 36 filters the time slice signal 
into various component filtered signals. As will be described in further detail, frequency 
domain data can be extracted directly from each of these filtered signals. This data can 
then be used to determine the characteristics of the specific phoneme contained within the 
time slice. 

By way of example and not limitation, the detailed program steps used to perform 
this particular function are shown in the flow chart illustrated in Figure 6. Referring first 
to program step 152, the current time segment (10.240 data samples; 232 milliseconds in 
duration) is received. The program then enters a loop, beginning with step 1 54, wherein 
the speech signal contained within the current time segment is successively filtered into 
its individual component waveforms by using a set of digital bandpass filters having 
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spectfic frequency bards, .„ the pref etred embodiment, these fluency bands are 
preca.eu a.ed. and srored in DSP program memory 40. At step ,54. the processor 36 
ob.ams the firs. fiUer band, designated as a ,ow frequency (fl) Md . high t , n 

'7' ' ab ' e ° f Pr ~ ned *« cutoff^oencies. In the prefer embodimT 
the filter cutoff frequencies are located at: 0 Hz. 250 Ha, 500 Ha. 1000 Hz, .500 Hz. 
2000 Hz. 2500 Hz, 3000 Hz. 3500 Hz, 4000 Hz. 4500 Hz. 5000 Hz. 6000 Hz 7000 ft 
S00C , Hz, ,000 HZ. a„ d 10 .000 Hz. „ wi „ he appreciated tha, different ^ 
cutoff frequencies could also be used. 

,o o hz T "7; iuri TJ fira pass ,hroueh ,he ,oop besinnine a - s,e " ■ * i »* *« 

Having se. lhe appropriate digits, filter p^^^. Ihe pr()CKSOr J6 
■o step ,58. where the actua, fihering of ,he time segment occurs. To do so. m^p 
mvoaesanother fitnction referred to as "Do Fi.ter Pas,- which is shown ,„ deu 
m Figure 6A and to which reference is now made 

as wel. as the ttme segment data is received (.0.240 data points). A , step I70 ,„ e 
coefficents for the fiher a re obtained from a predetermined ,ab,e of coeffi ients'that 

rrr.t t of ,he * amm mer bands — — — * 

recalculated by the processor 36 for each new filter band 

Having se, ,he fiher coefficients, the processor 36 executes program step ,72 
where the current time segment is ioaded into ,h. digits, fiher. Op.iona„y rather than 
■ending a„ da, sau,p,e, ,he signa, may be decimate, and omy every ,h ^ ^ 

IT Z ^ of one '° four Before ,he sisna ' is d « - - - 

P*s fiher*, down ,o a frequency ,ess than or equal to the origina, samp.e rate divided by 
2 n. At step ,74. the fihering operation is performed on the current time segmen H 
The resuhs of the fihering operation are written into corresponding time segment ^ 
oca.- «*- DSP d„a memory 42. Ahhough any one of a variety of diffeL ig 

't rrr C ° U ' a " '° "" da ' 8 ' " "» ~ -bodimen, , 
d,g,.a, bandpass fiher is an , 1R cascade-type fiher wi.h a Bu„erwonh response 

°^ fil "™S ope.a,ion is comp,e,e for ,he curren, fiher band, .he processor 
36 proceeds to step ,76 where the res* of the fihering operation are eva Jed tZ 
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is performed by the function referred to as "Evaluate Filtered Data," which is shown in 
further detail in Figure 6B, to which reference is now made. 

At step 182 of Evaluate Filtered Data, a time slice of the previously filtered time 
segment is received. Proceeding next to step 183, the amplitude of this filtered signal is 
calculated. The amplitude is calculated using the following equation: 



Amplitude . Ma *- M i» 



where max = the highest amplitude value in the time slice; and min = the 
lowest amplitude value in the time slice. 

At step 1 84 the frequency of the filtered signal is measured. This is performed by 
a function called "Measure Frequency of a Filtered Signal," which is shown in further 
detail in Figure 6C. Referring to that figure, at step 192 the filtered time slice data is 
received. At step 194, the processor 36 calculates the slope fi e. , the first derivative) of 
the filtered signal at each data point. This slope is calculated with reference to the line 
formed by the previous data point, the data point for which the slope is being calculated, 
and the data point following it, although other methods could also be used. 

Proceeding next to step 196, each of the data point locations corresponding to a 
slope changing from a positive value to a negative value is located. Zero crossings are 
determined beginning at the maximum amplitude value in the filtered signal and 
proceeding for at least three zero crossings. The maximum amplitude value represents the 
closure of the vocal folds. Taking this frequency measurement after the close of the vocal 
folds insures the most accurate frequency measurement. At step 1 98 the average distance 
between these zero crossing points is calculated. This average distance is the average 
period size of the signal, and thus the average frequency of the signal contained within this 
particular time slice can be calculated by dividing the sample rate by this average period. 
At step 200, the frequency of the signal and the average period size is returned to the 
calling function "Evaluate Filtered Data." Processing then continues at step 1 84 in Figure 
6B. 

Referring again to that figure, once the frequency of the signal has been 
determined, at step 186 it is determined whether that frequency falls within the cutoff 
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frequences of the current filter band. If so, step 1 88 is executed, wherein the frequency 
and the amplitude is stored in the "ffreq" and the "amp]" arrays of the time slice's 
corresponding Higgins data structure. If the frequency does not fall within the cutoff 
frequencies of the current filter band, then the frequency is discarded and step 190 is 
executed, thereby causing the DSP 36 to return to the calling function "Do Filter Pass - 
Processing then continues at step 176 in Figure 6A. 

^ issh °>™«Hsu re6 A,o m: e lh e"Evalua, e Fil t =r-Fu„c, i o„ha S b« n p erf onn e d 

TJ * 7 a " d amp,i,,,de of ,he curren ' frequency band h - * 

h Proceeds next to program slep ,78. Tha, s,ep checks whether the ,as. time slice 
has been proceed. ,f not. then the program continues in ,h. ,„op. and proceeds to 
program step ,76 ,o again operate the curren. band fiber on the nex, , ime slice as 
prev,ous,y described. If ,he Us. .ime sfice has been fihered. ,hen s.ep , 80 is perfo^ 
and ,he processor 36 re,ums ,o .he "Decompose , Speech Signa,- tuncion where 
processing continues at step 1 58 in Figure 6. 

firs. fi,,eTblT ,i " Ued K rcferenCe Fi8U ' e ^ "~ de ' ermineS " - 1 59 W 
firs, finer band hasjus, been used for this Hme segment. „ so. ,he nex, step in , he process 

ts shown a, prog™ s.ep ,62. There, a fimction referred to a, "Ce, Fundamenta, 
Frequency ,s performed, which is shown in ^ delai| „ 
reference is now made. 

Beginning „ s,ep 202 of tha, fimcion. the data associated with the curren, ,ime 
segnten. ,s reeled. Nex, ,he p ro cessor 36 proceeds ,o program step 2<M and idenofies 
by ,uery,„g ,he con,ems of ,he mspecive ^ array locations , which ^ ' 
have fluency components ,ha, are ,ess ,han 350 H 2 This range of frequencies <0 
trough 350 Hz) was chosen because the fimdamema, frequency for mo* spl^ 
somewhere whin ,he range of 70 , 350 fc Umi.ing ,be search to this range in 
°* - frequencies wi„ be iocated When a rime siice is ,oc a ,ed ,ha, doe 

Z^ amy l T ,his ^ n * plaMd in a h ~ <** - ~ 

The htstogmm ,s broken up into -bin,- which ccespond to 50 hz biocks within the 0 ,o 
z rsn^e. 

Once this histogram has been bui„. the DSP 36 proceeds to step 206 and 
thereto. The frequences con.amed wi.hin ,ha, particuiar bin are ,he„ averaged, and ,he 
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result is the Average Fundamental Frequency (F {) ) for this particular time segment. This 
value is then stored in DSP data memory 42. 

At step 208, the DSP 36 calculates the "moving" average of the average 
fundamental frequency, which is calculated to be equal to the average of the F Q 's 
calculated for the previous time segments. In the preferred embodiment, this moving 
average is calculated by keeping a running average of the previous eight time segment 
average fundamental frequencies, which corresponds to about two seconds of speech. 
This moving average can be used by the processor 36 to monitor trends in the speaker's 
voice, such as a change in volume, and pitch, or even a change in speaker. 

Once the average fundamental frequency for the time segment and the moving 
average of the fundamental frequency has been calculated, the processor 36 then enters 
a loop to determine whether the individual time slices that make up the current time 
segment have a fundamental frequency f t) component. This determination is made at step 
210, wherein the processor 36, beginning with the first time slice, compares the time 
slice's various frequency components (previously identified and stored within the ffreq 
array in the corresponding data structure) to the average fundamental frequency F Q 
identified in step 206. If one of the frequencies is within about 30% of that value, then 
that frequency is deemed to be a fundamental frequency of the time slice, and it is stored 
as a fundamental in the time slice Higgins data structure, as is indicated at program step 
214. As is shown at step 212, this comparison is done for each time slice. At step 216, 
after each time slice has been checked, the DSP 36 returns to the Decompose a Speech 
Signal routine, and continues processing at step 162 in Figure 6. 

At step 160 in that figure, the processor 36 checks if the last pair of cutoff 
frequencies (f L and f„) has yet been used. If not, the processor 36 continues the loop at 
step 1 54, and obtains the next set of cutoff frequencies for the next filter band. The DSP 
36 then continues the filtering process as described above until the last of the filter bands 
has been used to filter each time slice. Thus, each time segment will be filtered at each of 
the filter bands. When complete, the Higgins data structure for each time slice will have 
been updated with each a clear identification of the frequency, and its amplitude, 
contained within each of the various filter bands. Advantageously, the frequency data has 
thus far been obtained without utilizing an FFT approach, and the problems associated 
with that tool have thus been avoided. 
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Once the final pair of cutoff fluencies has been used a, srep ,60 step ,66 
causes the DSP 36 to execute a return to the main prog™ Crated in Figure 4 
conned the Decompose a Speech Si S „a, portion of ,he program method, there exist! 
a Htggtns Data structure for each time slice Contained within th a, structure are ^ 
sound characterise culled from both time domain data and fre q uen Cy domain dara 
These charactenst.cs can now he utilized to identify the particular sound, or phoneme' 
earned h y the s,gna>. ,„ rhe preferred embodiment, the series of program steps used to' 
mp emen, tms potion of the program method are stored within ,„, host program memory 
58. and are executed by the Host Sound Processor 54. 

Hoc* Ubefcd »P„,„, of Maximum Intelligence" shown a, irem ,06 in figun 4 ,„ £ 
Wton. the processor 54 ev.,ua,es which of the Higgins data sm.cu.s are Ctica, ,o 
thetdenttficnonofrhepho^me sounds contained within rhe time segment This reduce 

am °""' ° f PTOCeSS * - ded <° *-* * P"°"-c and insures ,h„ phonemes are 
accurately identified. esare 

One example of the detaiied program steps used to imp,emen, rhis fcnetion , re 
show. .„ Flgure 7 , t0 whic „ reference , proMss « 

wh„° . e host sound „ u raeivM ^ of Higghs P 0. 

cut™, »~ segment Vla the host interface 52. and stores them within boa, data memo^ 
«>. At step 232. for an ,ime shoes containing a voiced sound, the absolute value"! 
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**» va.ua of each fihered signaTs siope for a time shoe is then s,oZ m tZ op 
vanable of each applicable Higgins data smtcture. P 
The bos, processor 54 then proceeds to program step 234. At this step a search 
for ,h °~ *» *- »™cH have a sumSlopc va,„e gom- tbrouoh a 
and which also have an average amplitude U tba, goes throu h I'ZZ ~ 
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changing the least (Ls^ minimum slope) and where the sound is at it highest average 
amplitude (Le,, highest L s ), and are thus determined to be the point at which the dynamic 
sound has most closely reached a static or target sound. Those time slices that satisfy 
both criteria are identified as "points of maximum intelligence/ 1 and the corresponding 
PMI variable within the Higgins data structure is filled with a PMI value. Other time 
slices contain frequency components that are merely leading up to this target sound, and 
thus contain information that is less relevant to the identification of the particular 
phoneme. 

Having identified which "voiced" time slices should be considered "points of 
maximum intelligence," the same is done for all time slices containing an "unvoiced" 
sound. This is accomplished at step 236, where each unvoiced time slice having an 
average amplitude 1^ that goes through a maximum is identified as a "point of maximum 
intelligence." Again, the corresponding PMI variable within the appropriate Higgins data 
structure is filled with a PMI value. 

The host processor 54 then proceeds to program step 238 wherein the "duration" 
of each time slice identified as a PMI point is determined by calculating the number of 
time slices that have occurred since the last PMI time slice occurred. This duration value 
is the actual PMI value that is placed within each time slice data structure that has been 
identified as being a "point of maximum intelligence." The host processor 54 then returns, 
as is indicated at step 240, to the main calling routine shown in Figure A, 

Referring again to that figure, the next functional block performed is the 
"Evaluate" function, shown at 108. This function analyzes the sound characteristics of 
each of the time slices identified as points of maximum intelligence, and determines the 
most likely sounds that occur during these time slices. This is generally accomplished by 
comparing the measured sound characteristics (Le^ the contents of the Higgins structure) 
to a set of standard sound characteristics. The sound standards have been compiled by 
conducting tests on a cross-section of various individual speaker's sound patterns, 
identifying the characteristics of each of the sounds, and then formulating a table of 
standard sound characteristics for each of the forty or so phonemes which make up the 
given language. 

Referring to Figure 8, one example of the detailed program steps used to 
implement the Evaluate function are illustrated. Beginning at program step 242, each of 



WO 97/34293 



PC17US96/D3140 



25 



the time slices identified as PMI points are received. At step 244, the host processor 54 
executes a function referred to as "Calculate Harmonic Formant Standards. - 

The Calculate Harmonic Formant Standards function operates on the premise that 
the locat.on of frequencies within any particular sound can be represented in terms of 
"half-steps." The term half-steps is typically used in the musical context, but it is also a 
helpful in the analysis of sounds. On a musical or chromatic scale, the frequency of the 
notes doubles every octave. Since there are twelve notes within an octave, the frequency 
of two notes are related by the formula. 

UPPER NOTE = (LOWER NOTE) * 2*'", 
where n is the number of half-steps. 

Given two frequencies (or notes), the number of half-steps between them is given 
by the equation: 



12 .log (jfogg: Frequency 

S , lower Frequency 

log (2) 



Thus, the various frequencies within a particular sound can be thought of in terms 
of a musical scale by emulating the distance between each component frequency and the 
fundamental frequency in terms of half-steps. This notion is important because it has been 
found that for any given sound, the distance (^ the number of half-steps) between the 
fundamental frequency and the other component frequencies of the sound are very similar 
for all speakers - men, women and children. 

The Calculate Harmonic Formant Standards function makes use of this 
Phenomania by building a "standard" musical table for all sounds. Specifically, this table 
.ncludes the relative location of each of the sound's frequency components in terms of 
thetr distance from a fundamental frequency, wherein the distance is designated as a 
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number of half-steps. This is done for each phoneme sound. This standard musical table 
is derived from the signal characteristics that are present in each sound type (phoneme), 
which are obtained via sample data taken from a cross-section of speakers. 

Specifically, voice samples were taken from a representative group of speakers 
whose fundamental frequencies cover a range of about 70 Hz to about 350 Hz. The voice 
samples are specifically chosen so that they include all of the forty or so phoneme sounds 
that make up the English language. Next, the time domain signal for each phoneme sound 
is2 evaluated, and all of the frequency components are extracted in the manner previously 
described in the Decompose function using the same frequency bands. Similarly, the 
amplitudes for each frequency component are also measured. From this data, the number 
of half steps between the particular phoneme sound's fundamental frequency and each of 
the sound's component frequencies is determined. This is done for all phoneme sound 
types. A separate x-y plot can then be prepared for each of the frequency bands for each 
sound. Each speaker's sample points are plotted, with the speaker's fundamental 
frequency (in half-steps) on the x-axis, and the distance between the measured band 
frequency and the fundamental frequency (in half-steps) on the y-axis. A linear regression 
is then performed on the resulting data, and a resulting "best fit line" drawn through the 
data points. An example of such a plot is shown in Figures I2A-12C, which illustrates the 
representative data points for the sound "Ah" (P ASCII sound 024), for the first three 
frequency bands (shown as B 1 , B2 and B3). 

Graphs of this type are prepared for ail of the phoneme sound types, and the slope 
and the y-intercept equations for each frequency band for each sound are derived. The 
results are placed in a tabular format, one preferred example of which is shown in TABLE 
II in Appendix A. As is shown, this table contains a phoneme sound (indicated as a 
PASCII value) and, for each of the bandpass frequencies, the slope (m) and the y-intercept 
(b) of • e resulting linear regression line. Also included in the table is the mean of the 
signal amplitudes for all speakers, divided by the corresponding L s value, at each 
particular frequency band. Alternatively, the median amplitude value may be used instead. 

As can be seen from the graph in Figures 12A-I2C, the data points for each of the 
speakers in the test group are tightly grouped about the regression line, regardless of the 
speaker's fundamental frequency. This same pattern exists for most all other sounds as 
well. Further, the pattern extends to speakers other than those used to generate the 
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sample data. In fact, if the fundamental frequency and the frequency band locations (in 
half-steps) are known for any given sound generated by any given user, the corresponding 
sound type (phoneme) can be determined by comparison to these standard values. 

The Calculate Harmonic Formant Standards function utilizes this standard sound 
equations data (TABLE II) to build a representative musical table containing the standard 
half-step distances for each sound. Importantly, it builds this standards table so that it is 
correlated to a specific fundamental frequency, and specifically, it uses the fundamental 
frequency of the time slice currently being evaluated. The function also builds a musical 
table for the current time slice's measured data (ml, the Higgins structure and ffreq 
data). The time slice '•measured" data is then compared to the sound "standard" data, and 
the closest match indicates the likely sound type (phoneme). Since what is being 
compared is essentially the relative half-step distances between the various frequency 
components and the fundamental frequency - which for any given sound are consistent 
for every speaker - the technique insures that the sound is recognized independently of 
the particular speaker. 

One example of the detailed program steps used to accomplish the "Calculate 
Harmonic Formant Standards" function is shown in Figure 8 A, to which reference is now 
made. Beginning at program step 280, the Higgins structure for the current time slice is 
received. Step 282 then converts that time slice into a musical scale. This is done by 
calculating the number half-steps each frequency component (identified in the 
"Decompose" function and stored in the ffreq array) is located from the fundamental 
frequency. These distances are calculated with the following equation: 



12 . log 



fa 



60(log(2)) 



where N = 1 through 15, corresponding to each of the different frequencies 
calculated ,n the Decompose function and stored in the ffreq array for th "s time 
sl.ce; and f u = the fundamental frequency for this time slice, also^tored n the 
ttfigms data structure. The value 60 is used to normalize the number oSf'stens 
to an approx.mate max.mum number of half-steps that occur. P 
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The results of the calculation are stored by the host processor 54 as an array in the host 
processor data memory 60. 

Having converted the time slice to the musical scale, the processor 54 next enters 
a loop to begin building the corresponding sound standards table, so it too is represented 
in the musical scale. Again, this is accomplished with the standard equations data 
(TABLE II), which is also stored as an array in host data memory 60. 

Beginning at step 284, the host processor 54 obtains the standard equations data 
for a sound, and queries whether the current time slice contains a voiced sound. If not, 
the processor 54 proceeds to program step 290, where it calculates the number of half- 
steps each frequency component (for each of the frequency bands previously identified) 
is located from the fundamental frequency. The new "standards" are calculated relative 
to the fundamental frequency of the current time slice. The formula used to calculate 
these distance is: 

~ m * f 0 + b 
65 

where m = the slope of the standard equation line previously identified; b = the y- 
intercept of the standard equation line previously identified; f Q = fundamental frequency 
of the current time slice; and the value 60 is used to normalize the number of half-steps 
to an approximate maximum number of half-steps that occur. 

This calculation is completed for all 15 of the frequency bands. Note that 
unvoiced sounds do not have a "fundamental" frequency stored in the data structure's f 0 
variable. For purposes of program step 290, the frequency value identified in the first 
frequency band ( i.e. contained in the first location of the ffireq array) is used as a 
"fundamental." 

If at step 284 it is determined that the current time slice is voiced, the host sound 
processor 54 proceeds to program step 286 and queries whether the current standard 
sound is a fricative. If it is a fricative sound, then the processor 54 proceeds to step 290 
to calculate the standards for all of the frequency bands (one through fifteen) in the 
manner described above. 

If the current sound is not a fricative, the host processor 54 proceeds to step 288. 
At that step, the standards are calculated in the same manner as step 290, but only for the 
frequency bands 1 through 1 1 . 
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After the completion of program step 288 or step 290. the processor 54 proceeds 
to step 292. where it queries whether the final standard sound in the table has been 
processed for this time slice. If not. the next sound and its associated slope and intercept 
data are obtained, and the loop beginning at step 284 is re-executed. If no sounds remain, 
then the new table of standard values, expressed in terms of the musical scale, is complete 
for the current time slice (which has also been convened to the musical scale). The host 
processor 54 exits the routine at step 294, and returns to the Evaluate function at step 244 
in Figure 8. 

Referring again to that figure, the host processor 54 next executes program step 
250 to query whether the current time slice is voiced. If not, the processor 54 executes 
program step 246. which executes a function referred to as "Multivariate Pattern 
Recognition." This function merely compares "standard" sound data with "measured- 
time slice data, and evaluates how closely the two sets of data correspond. In the 
preferred embodiment, the function is used to compare the frequency (expressed in half- 
steps) and amplitude components of each of the standard sounds to the frequency (also 
expressed in half-steps) and amplitude components of the current time slice. A close 
match indicates that the time slice contains that particular sound (phoneme). 

One example of the currently preferred set of program steps used to implement the 
"Multivariate Pattern Recognition" function is shown in the program flow chart of Figure 
8B, to which reference is now made. Beginning at step 260. an array containing the 
standard sound frequency component locations and their respective amplitudes, and an 
array containing the current time slice frequency component locations and their respective 
amplitudes, are received. Note that the frequency locations are expressed in terms of half- 
step distances from a fundamental frequency, calculated in the "Calculate Harmonic 
Formant Standards" function. The standard amplitude values are obtained from the test 
data previously described, examples of which are shown in TABLE II, and the amplitude 
components for each time slice are contained in the Higgins structure "amplitude" array, 
as previously described. 

At step 262, the first sound standard contained in the standards array is compared 
to the corresponding time slice data. Specifically, each time slice frequency and amplitude 
"data point" is compared to each of the current sound standard frequency and amplitude 
"data points." The data points that match the closest are then determined. 
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Next, at program step 264, for the data points that match most closely, the 
Euclidean distance between the time slice data and the corresponding standard data is 
calculated. The Euclidean distance (ED) is calculated with the following equation: 



ED 



i - n 



, mutana 



^ Siguier J(a f ) - mmmd(fi£^ 



Where n = the number of data points compared; T' indicates frequency; and V 
indicates amplitude. 

At program step 266, this distance is compared to the distances found for other 
sound standards. If it is one of the five smallest found thus far, the corresponding 
standard sound is saved in the Higgins structure in the POSSIBLE PHONEMES array at 
step 268. The processor then proceeds to step 270 to check if this was the last sound 
standard within the array and, if not, the next standard is obtained at program step 272. 
The same comparison loop is then performed for the next standard sound. If at step 266 
it is found that the calculated Euclidean distance is not one of the five smallest distances 
already found, then the processor 54 discards that sound as a possibility, and proceeds to 
step 270 to check if this was the final standard sound within the array. If not, the next 
sound standard is obtained at program step 272, and the comparison loop is re-executed. 

This loop continues to compare the current time slice data to standard sound data 
until it is determined at step 270 that there are no remaining sound standards for this 
particular time slice. At that point, step 274 is performed, where each of the sound 
possibilities previously identified (up to five) are prioritized in descending order of 
probability. The prioritization is based on the following equation: 



Probability . — 

hUMRER cf vatutt - I 

where ED = Euclidean Distance calculated for this sound; SM = the sum 
of all EDs of identified sound possibilities. 
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The higher the probability value, the more likely that the corresponding sound is 
the sound contained within the time slice. Once the probabilities for each possible sound 
have been determined, the processor 54 proceeds to step 276. and returns to the calling 
routine Evaluate at step 246 in Figure 8. The Higgins structure now contains an array of 
the most probable phonemes (up to five) corresponding to this particular time slice. Host 
Processor 54 then performs step 248 to determine if there is another time slice to evaluate. 
If there is, the processor 54 reenters the loop at step 242 to obtain the next time slice and 
continue processing. If no time slices remain, the processor 54 executes step 260 and 
returns to the main calling routine in Figure 4. 

If at step 250, it was instead determined that the current time slice contained a 
voiced sound, then the host sound processor 54 proceeds to program step 252. At this 
step, the host processor 54 determines whether the sound carried in the time slice is a 
voiced fricative, or if it is another type of voiced sound. This determination is made by 
inspecting the Relative Amplitude (RA) value and the frequency values contained in the 
ffreq array. If RA is relatively low. which in the preferred embodiment is any value less 
than about 65. and if there are any frequency components that are relatively high, which 
in the preferred embodiment is any frequency above about 6 kHz, then the sound is 
deemed a voiced fricative, and host 54 proceeds to program step 254. Otherwise, 54 
proceeds to program step 256. 

Program steps 254 and 256 both invoke the "Multivariate Pattern Recognition- 
routine, and both return a Higgins structure containing up to five possible sounds, as 
previously described. After completing program step 254, the host processor 54 will get 
the next time slice, as is indicated at step 248. 
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However, when program step 258 is completed, the host processor 54 will execute 
program step 258, which corresponds to a function referred to as "Adjust for Relative 
Amplitude." This function assigns new probability levels to each of the possible sounds 
previously identified by the "Multivariate Pattern Recognition" routine and stored in the 
Higgins data structure. This adjustment in probability is based on yet another comparison 
between the time slice data and standard sound data. One example of the presently 
preferred program steps needed to implement this function is shown in Figure 8C, to 
which reference is now made. 

Beginning at program step 300. the relative amplitude (RA) for the time slice is 
calculated using the following formula: 

RA « ^ 

MaxAmpl 



where L, is the absolute average of the amplitude for this time slice stored 
in the Higgins Structure; and MaxAmpl is the "moving average" over the previous 
2 seconds of the maximum for each time segment (10,240 data points) of data. 

The host processor 54 then proceeds to program step 304 and calculates the 
difference between the standard relative amplitude calculated in step 300, and the standard 
relative amplitude for each of the probable sounds contained in the Higgins data structure. 
The standard amplitude data is comprised of average amplitudes obtained from a 
representative cross-sample of speakers, an example of which is shown in TABLE III in 
the appendix. 

Next, at program step 306 the differences are ranked, with the smallest difference 
having the largest rank, and the largest difference having the smallest rank of one. 
Proceeding next to program step 308, new probability values for each of the probable 
sounds are calculated by averaging the previous confidence level with the new percent 
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rank calculated in step 306. At program step 310. the probable sounds are then re-sorled, 
from most probable to least probable, based on the new confidence values calculated in 
step 308. At step 3 12, the host processor 54 returns to the calling routine "Evaluate" at 
program step 258 in Figure 8. 

Referring again to Figure 8 having completed the Adjust for Relative Amplitude 
routine, the host sound processor proceeds to program step 248 and determines whether 
another time slice remains. If so, the processor 54 reenters the loop at step 242, and 
processes a new time slice in the same manner as described above. If not, the processor 
54 executes step 260 and returns to the main calling routine in Figure 4. 

The next step performed by the sound recognition host processor 54 is shown at 
block 1 10 in Figure 4 and is referred to as the "Compress Phones" function. As already 
discussed, this function discards those time slices in the current time segment that are not 
designated "points of maximum intelligence." In addition, it combines any contiguous 
time slices that represent "quiet" sounds. By eliminating the unnecessary time slices, all 
that remains are the time slices (and associated Higgins structure data) needed to identify 
the phonemes contained within the current time segment. This step further reduces 
overall processing requirements and insures that the system is capable of performing 
sound recognition in substantially real time. 

One presently preferred example of the detailed program steps used to implement 
the "Compress Phones" function is shown in Figure 9, to which reference is now made 
Beginning at program step 316. the host sound processor 54 receives the existing 
sequence of time slices and the associated Higgins data structures. At program step 3 1 8 
processor 54 eliminates all Higgins structures that do not contain PMI points. Next at 
program step 320 the processor 54 identifies contiguous data structures containing "quiet- 
sections, and reduces those contiguous sections into a single representative data structure 
The PMI duration value in that sing.e data structure is incremented so as to represent all 
of the contiguous "quiet" structures that were combined. 

At this point, there exists in the host processor data memory 60 a continuous 
stream of Higgins data structures, each of which contains sound characteristic data and 
the posable phoneme(s) associated therewith. All unnecessary, irrelevant and/or 
redundant aspects of the time segment have been discarded so that the remaining data 
stream represents the "essence" of the incoming speech signal. Importantly, these 
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essential characteristics have been culled from the speech signal in a manner that is not 
dependent on any one particular speaker. Further, they have been extracted in a manner 
such that the speech signal can be processed in substantially real time — that is, the input 
can be received and processed at normal rate of speech. 

Having reduced the Higgins structure data, the Compress Phones function causes 
the sound recognition host processor 54 to place that data in host data memory 60 in 
program step 324. Proceeding next to program step 326, the host sound processor 54 
returns to the main portion of the program method in Figure 4. 

As is shown in that figure, the next portion of the program method corresponds 
to the function referred to as the "Linguistic Processor." The Linguistic Processor is that 
portion of the method which further analyzes the Higgins structure data and, by applying 
a series of higher level linguistic processing techniques, identifies the word or phrase that 
is contained within the current time segment portion of the incoming speech signal. 

Although alternative linguistic processing techniques and approaches could be 
used, one presently preferred set of program steps used to implement the Linguistic 
Processor is shown in the flow chart of Figure 10. Beginning at program step 350 of that 
function, the host sound processor 54 receives the set of Higgins structure data created 
by the previously executed Compress Phones function. As already discussed, this data 
represents a stream of the possible phonemes contained in the current time segment 
portion of the incoming speech signal. At program step 352, the processor 54 passes this 
data to a function referred to as "Dictionary Lookup." 

In one preferred embodiment, the Dictionary Lookup function utilizes a phonetic- 
English dictionary that contains the English spelling of a word along with its 
corresponding phonetic representation. The dictionary can thus be used to identify the 
English word that corresponds to a particular stream of phonemes. The dictionary is 
stored in a suitable database structured format, and is placed within the dictionary portion 
of computer memory 62. The phonetic dictionary can be logically separated into several 
separate dictionaries. For instance, in the preferred embodiment, the first dictionary 
contains a database of the most commonly used English words. Another dictionary may 
include a database that contains a more comprehensive Webster-like collection of words. 
Other dictionaries may be comprised of more specialized words, and may vary depending 
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on the particular application. For instance, there may be a user defined dictionary, a 
medical dictionary, a legal dictionary, and so on. 

All languages can be described in terms of a particular set of phonetic sounds 
Thus, it will be appreciated that although the preferred embodiment utilizes an English 
word dictionary, any other phonetic to non-English language dictionary could be used 

Basically. Dictionary Lookup scans the appropriate dictionary to determine if the 
mcommg sequence of sounds (as identified by the Higgins data structures) form a 
complete word, or the beginnings of a possible word. To do so. the sounds are placed 
mto paths or "sequences" to help detect, by way of the phonetic dictionary, the beginning 
or end of possible words. Thus, as each phoneme sound is received, it is added to the end 
of a all non-completed "sequences." Each sequence is compared to the contents of the 
d,ct,onary to determine if it leads to a possible word. When a valid word (or set of 
poss,bie words) is identified, it is passed to the next functional block within the Linguistic 
Processor portion of the program for further analysis. 

By way of example and not limitation. Figure I OA illustrates one presently 
preferred set of program steps used to implement the Dictionary Lookup function The 
funct,on begins at program step 380. where it receives the current set of Higgins 
structures, corresponding to the current time segment of speech. At program step 384 
the host sound processor 54 obtains a phoneme sound (as represented in a Higgins 
stature) and proceeds to program step 386 where it positions a search pointer within the 
current dictionary that corresponds to the first active sequence. An "active" sequence is 
a sequence that could potentially form a word with the addition of a new sound or sounds 
In contrast, a sequence is deemed "inactive" when it is determined that there is no 
possib.l.ty of forming a word with the addition of new sounds. 

Thus, at program step 386 the new phonetic sound is appended to the first active 
seque,,e. At program step 388. the host processor 54 checks, by scanning the current 
denary contents, whether the current sequence either f orms a word, or whether it could 
potennaliy form a word by appending another sound(s) to it. If so. the sequence is 
updated by appending to it the new phonetic sound at program step 390 Next at 
program step 392, the host processor determines whether the current sequence forms a 
valid word. If it doe, a -new sequence' flag is set at program step 394, which indicates 
that a new sequence should be forced beginning with the very next sound. If a valid word 
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is not yet formed, the processor 54 skips step 394, and proceeds directly to program step 
396. 

If at step 388 the host processor 54 instead determines, after scanning the 
dictionary database, that the current sequence would not ever lead to a valid word even 
if additional sounds were appended, then the processor 54 proceeds to program step 398. 
At this step, this sequence is marked "inactive." The processor 54 then proceeds to 
program step 396. 

At step 396, the processor 54 checks if there are any more active sequences to 
which the current sound should be appended. If so, the processor 54 will proceed to 
program step 400 and append the sound to this next active sequence. The processor 54 
will then re-execute program step 388, and process this newly formed sequence in the 
same manner described above. 

If at program step 396 it is instead determined that there are no remaining active 
sequences, then host sound processor 54 proceeds to program step 402. There, the 'new 
sequence 1 flag is queried to determine if it was set at program step 394, thereby indicating 
that the previous sound had created a valid word in combination with an active sequence. 
If set, the processor will proceed to program step 406 and create a new sequence, and 
then go to program step 408. If not set, the processor 54 will instead proceed to step 
404, where it will determine whether all sequences are now inactive. If they are, 
processor 54 will proceed immediately to program step 408, and if not, the processor 54 
will instead proceed to step 406 where it will open a new sequence before proceeding to 
program step 408. 

At program step 408, the host sound processor 54 evaluates whether a primary 
word has been completed, by querying whether all of the inactive sequences, and the first 
active sequence result in a common word break. If yes, the processor 54 will output all 
of the valid words that have been identified thus far to the main calling routine portion of 
the Linguistic Processor. The processor 54 will then discard all of the inactive sequences, 
and proceed to step 384 to obtain the next Higgins structure sound. If at step 408 it is 
instead determined that a primary word has not yet been finished, the processor 54 will 
proceed directly to program step 384 to obtain the next Higgins structure sound. Once 
a new sound is obtained at step 384, the host processor 54 proceeds directly to step 386 
and continues the above described process. 
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As the Dictionary Lookup function extracts words from the Higgins structure 
data, there may certain word possibilities that have not yet been resolved. Thus, the 
Linguistic Processor may optionally include additional functions which further resolve the 
remaining word possibilities. One such optional function is referred to as the "Word 
Collocations" function, shown at block 354 in Figure 10. 

Generally, the W P rd Collocations function monitors the word possibilities that 
have been identified by the Dictionary Lookup function to see if they form a "common- 
word collocation. A set of these common word collocations are stored in a separate 
dictionary database within dictionary memory 64. In this way, certain word possibilities 
can be eliminated, or at least assigned lower confidence levels, because they do not fit 
within what is otherwise considered a common word collocation. One presently preferred 
example of the program steps used to implement this particular function are shown, by 
way of example and not limitation, in Figure 10B, to which reference is now made. 

Beginning at program step 420, a set of word possibilities are received. Beginning 
with one of those words at step 422, the host sound processor 54 next proceeds to 
program step 424 where it obtains any collocation(s) that have been formed by preceding 
words. The existence of such collocations would be determined by continuously 
comparing words and phrases to the collocation dictionary contents. If such a collocation 
or collocations exist, then the current word possibility is tested to see if it fits within the 
collocation context. At step 428, those collocations which no longer apply are discarded. 
The processor 54 then proceeds to step 430 to determine if any word possibilities remain, 
and if so, the remaining word(s) is also tested within the collocation context beginning ai 
program step 422. 

Once this process has been applied to all word possibilities, the processor 54 
identifies which word, or words, were found to "fit" within the collocation, before 
returning, via program step 436, to the main Linguistic Processor routine. Based on the 
results of the Collocation routine, certain of the remaining word possibilities can then be 
eliminated, or at least assigned a lower confidence level. 

Another optional function that can be used to resolve remaining word possibilities 
is the "Grammar Check" function, shown at block 356 in Figure 10. This function 
evaluates a word possibility by applying certain grammatical rules, and then determining 
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whether the word complies with those rules. Words that do not grammatically fit can be 
eliminated as possibilities, or assigned lower confidence levels. 

By way of example, the Grammar Check function can be implemented with the 
program steps that are shown in Figure IOC. Thus, at step 440, a current word possibility 
along with a preceding word and a following word are received. Then at step 442, a set 
of grammar rules, stored in a portion of host sound processor memory, are queried to 
determine what "part of speech" would best fit in the grammatical context of the 
preceding word and the following word. If the current word possibility matches this "part 
of speech" at step 444, then that word is assigned a higher confidence level before 
returning to the Linguistic Processor at step 446. If the current word does not comply 
with the grammatical "best fit" at step 444, then it is assigned a low confidence level and 
returned to the main routine at step 446. Again, this confidence level can then be used to 
further eliminate remaining word possibilities. 

Referring again to Figure 10, having completed the various functions which 
identify the word content of the incoming speech signal, the Linguistic Processor function 
causes the host sound processor 54 to determine the number of word possibilities that still 
exist for any given series of Higgins structures. 

If no word possibilities have yet been identified, then the processor 54 will 
determine, at program step 366, if there remains a phonetic dictionary database ( i.e. . a 
specialized dictionary, a user defined dictionary, etc.) that has not yet been searched. If 
so, the processor 54 will obtain the new dictionary at step 368, and then re-execute the 
searching algorithm beginning at program step 352. If however no dictionaries remain, 
then the corresponding unidentified series of phoneme sounds (the unidentified "word") 
will be sent directly to the Command Processor portion of the program method, which 
resides on Host computer 22. 

• fat program step 358 more than one word possibility still remains, the remaining 
words are all sent to the Command Processor. Similarly, if only one word possibility 
remains, that word is sent to the directly to the Command Processor portion of the 
algorithm. Having output the word, or possible words, program step 370 causes the host 
sound processor 54 to return to the main algorithm, shown on Figure 4. 

As words are extracted from the incoming speech signal by the Linguistic 
Processor, they are immediately passed to the next function in the overall program method 
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referred to as the "Command Processor," shown at function block 1 14 in Figure 4. In the 
preferred embodiment, the Command Processor is a series of program steps that are 
executed by a Host Computer 22, such as a standard desktop personal computer. As 
already noted, the host computer 22 receives the incoming words by way of a suitable 
communications medium, such as a standard RS-232 cable 24 and interface 66. The 
Command Processor then receives each word, and determines the manner by which it 
should be used on the host computer 22. For example, a spoken word may be input as 
text directly into an application, such as a wordprocessor document. Conversely, the 
spoken word may be passed as a command to the operating system or application. 

Referring next to Figure 1 1, illustrated is one preferred example of the program 
steps used to implement the Command Processor function. To begin, program step 450 
causes the host computer 22 to receive a word created by the Linguistic Processor portion 
of the algorithm. The host computer 22 then determines, at step 452, whether the word 
received is an operating system command. This is done by comparing the word to the 
contents of a definition file database, which defines all words that constitute operating 
system commands. If such a command word is received, it is passed directly to the host 
computer 22 operating system, as is shown at program step 454. 

If the incoming word does not constitute an operating system command, step 456 
is executed, where it is determined if the word is instead an application command, as for 
instance, a command to a wordprocessor or spreadsheet. Again, this determination is 
made by comparing the word to another definition file database, which defines all words 
that constitute an application command. If the word is an application command word, 
then it is passed directly, at step 458, to the intended application. 

If the incoming word is neither a operating system command, or an application 
command, then program step 460 is executed, where it is determined whether the 
Command Processor is still in a "command mode." If so. the word is discarded at step 
464, and essentially ignored. However, if the Command Processor is not in a command 
mode, then the word will be sent directly to the current application as text. 

Once a word is passed as a command to either the operating system or application 
at program steps 454 and 458, the host computer 22 proceeds to program step 466 to 
determine whether the particular command sequence is yet complete. If not, the algorithm 
remains in a "command mode," and continues to monitor incoming words so as to pass 
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them as commands directly to the respective operating system or application. If the 
command sequence is complete at step 466, then the algorithm will exit the command 
mode at program step 470. 

In this way, the Command Processor acts as a front-end to the operating system 
and/or to the applications that are executing on the host computer 22. As each new word 
is received, it is selectively directed to the appropriate computer resource. Operating in 
this manner, the system and method of the current invention act as a means for entering 
data and/or commands to a standard personal computer. As such, the system essentially 
replaces, or supplements other computer input devices, such as keyboards and pointing 
devices. 



III. SUMMARY AND SCOPE OF THE INVENTION 

In summary, the system and method of the present invention for speech 
recognition provides a powerful and much needed tool for providing user independent 
speech recognition. Importantly, the system and method extracts only the essential 
components of an incoming speech signal. The system then isolates those components in 
a manner such that the underlying sound characteristics that are common to all speakers 
can be identified, and thereby used to accurately identify the phonetic make-up of the 
speech signal. This permits the system and method to recognize speech utterances from 
any speaker of a given language, without requiring the user to first "train" the system with 
specific voice characteristics. 

Further, the system and method implements this user independent speech 
recognition in a manner such that it occurs in substantially "real time." As such, the user 
can speak at normal conversational speeds, and is not required to pause between each 
word. 

Finally, the system utilizes various linguistic processing techniques to translate the 
identified phonetic sounds into a corresponding word or phrase, of any given language. 
Once the phonetic stream is identified, the system is capable of recognizing a large 
vocabulary of words and phrases. 

While the system and method of the present invention has been described in the 
context of the presently preferred embodiment and the examples illustrated and described 
herein, the invention may be embodied in other specific ways or in other specific forms 
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whhou. departing from i,s spin, or essentia! ctaraceriaics. Therefore, the described 
embodies and campies are ,„ ta jn „, rKpects 0 „ |y „ .^.^ ^ ^ 

I"" ° f * he in «" ti0 " <— « by ,b. appended Cairns 

«her .ha, by .he foregoing description, and changes which come wi,hin ,he meaning 
and range of equivalency of the claims are to be embraced within their scope. 
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1 . A sound recognition system for identifying the phoneme sound types that 
are contained within an audio speech signal, the sound recognition system comprising: 

audio processor means for receiving an audio speech signal and for 
converting the audio speech signal into a representative audio electrical signal; 

analog-to-digital converter means for digitizing the audio electrical signal 
at a predetermined sampling rate so as to produce a digitized audio signal; and 

sound recognition means for performing a time domain analysis on the 
digitized audio signal so as to identify at least one time domain sound 
characteristic of said audio speech signal, and for successively filtering the 
digitized audio signal, using a plurality of filter bands having predetermined cutoff 
frequencies, into a corresponding plurality of separate time domain filtered signals 
so as to measure at least one frequency domain sound characteristic of each of 
said filtered signals, and based on said at least one time domain characteristic and 
at least one frequency domain characteristic, for identifying at least one phoneme 
sound type contained within the audio speech signal. 

2. A sound recognition system as defined in claim 1 wherein the audio 
processor means comprises: 

means for inputting the audio speech signal and for convening it to an 
audio electrical signal; and 

means for conditioning the audio electrical signal so that it is 
representative electrical form that is suitable for digital sampling. 



in a 



3. A sound recognition system as defined in claim I wherein the conditioning 
means comprises: 

signal amplification means for amplifying the audio electrical signal to a 
predetermined level; 

means for limiting the level of the amplified audio electrical signal to a 
predetermined output level; and 

filter means, connected to the limiting means, for limiting the audio 
electrical signal to a predetermined maximum frequency of interest and thereby 
providing the representative audio electrical signal. 
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4. A sound recognition system as defined in claim 1, further comprising 
electronic means for receiving at least one word in a preselected language corresponding 
to the at least one phoneme sound type contained within the audio speech signal, and for 
programmably processing the at least one word as either a data input or as a command 

5 input. 

5. A sound recognition system as defined in claim 1 , wherein the said at least 
one time domain characteristic includes at least one of the following: a average amplitude 
of the audio speech signal; a absolute difference average of the audio speech signal; and 

10 a zero crossing rate of the audio speech signal. 

6. A sound recognition system as defined in claim 1 , wherein the said at least 
one frequency domain characteristic includes at least one of the following: a frequency of 
at least one of said filtered signals; and a amplitude of at least one of said filtered signals. 

15 

7. A sound recognition system for identifying the phoneme sound types that 
are contained within an audio speech signal, the sound recognition system comprising: 

audio processor means for receiving an audio speech signal and for 
converting the audio speech signal into a representative audio electrical signal; 
20 analog-to-digital converter means for digitizing the audio electrical signal 

at a predetermined sampling rate so as to produce a digitized audio signal; 

sound recognition means for programmably carrying out the following 
program steps: 

(a) performing a time domain analysis on the digitized audio 
25 signal so as to identify at least one time domain sound characteristic of 

said audio speech signal; 

(b) using a plurality of filter bands having predetermined cutoff 
frequencies, successively filtering the digitized audio signal into a 
corresponding plurality of separate time domain filtered signals; 

30 (c) measuring at least one frequency domain sound 

characteristic of each of said filtered signals; and 
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(d) based on the at least one time domain characteristic and the 
at least one frequency domain characteristic, identifying at least one 
phoneme sound type contained within the audio speech signal. 

8. A sound recognition system as defined in claim 7 wherein the audio 
processor means comprises: 

means for inputting the audio speech signal and for converting it to an 
audio electrical signal; and 



means for conditioning the audio electrical signal so that it is 
representative electrical form that is suitable for digital sampling. 



in a 



9. A sound recognition system as denned in claim 8 wherein the conditioning 
means comprises: 

signal amplification means for amplifying the audio electrical signal to a 
predetermined level; 

means for limiting the level of the amplified audio electrical signal to a 
predetermined output level; and 

filter means, connected to the limiting means, for limiting the audio 
electrical signal to a predetermined maximum frequency of interest and thereby 
providing the representative audio electrical signal. 

1 0. A sound recognition system as defined in claim 9. wherein the said at least 
one time domain characteristic includes at least one of the following: a average amplitude 
of the audio speech signal; a absolute difference average of the audio speech signal; and 
a zero crossing rate of the audio speech signal. 

11. A sound recognition system as defined in claim 1 0, wherein the said at least 
one frequency domain characteristic includes at least one of the following: a frequency of 
at least one of said filtered signals; and a amplitude of at least one of said filtered signals. 

12. A sound recognition system as denned in claim 1 1, wherein the at least one 
phoneme sound type contained within the audio speech signal is identified by comparing 
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the at least one measured frequency domain characteristic to a plurality of sound standards 
each having an associated phoneme sound type and at least one corresponding standard 
frequency domain characteristic, wherein the at least one identified sound type is the 
sound standard type having a standard frequency domain characteristic that matches the 
measured frequency domain characteristic most closely. 

13. A sound recognition system as defined in claim 12, wherein the at least one 
measured frequency domain characteristic, and the plurality of standard frequency domain 
characteristics are expressed in terms of a chromatic scale. 

14. A sound recognition system as defined in claim 13, further comprising 
electronic means for receiving at least one word in a preselected language corresponding 
to the at least one phoneme sound type contained within the audio speech signal, and for 
programmably processing the at least one word as either a data input or as a command 
input. 

15. A sound recognition system for identifying the phoneme sound types that 
are contained within an audio speech signal, the sound recognition system comprising: 

audio processor means for receiving an audio speech signal and for 
converting the audio speech signal into a representative audio electrical signal; 

analog-to-digital converter means for digitizing the audio electrical signal 
at a predetermined sampling rate so as to produce a digitized audio signal; 

digital sound processor means for performing a time domain analysis on 
the digitized audio signal so as to identify at least one time domain sound 
characteristic of said audio speech signal, and for successively filtering the 
digitized audio signal, using a plurality of filter bands having predetermined cutoff 
frequencies, into a corresponding plurality of separate time domain filtered signals 
so as to measure at least one frequency domain sound characteristic of each of the 
filtered signals; and 

host sound processor means for identifying at least one phoneme sound 
type contained within the audio speech signal based on the at least one time 
domain characteristic and the at least one frequency domain characteristic, and for 
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translating said at least one phoneme sound type into at least one representative 
word of a preselected language. 

16. A sound recognition system as defined in claim 15 wherein the audio 
processor means comprises: 

means for inputting the audio speech signal and for converting it to an 
audio electrical signal; and 

means for conditioning the audio electrical signal so that it is 
representative electrical form that is suitable for digital sampling. 



in a 



17. A sound recognition system as defined in claim 16 wherein the 
conditioning means comprises: 

signal amplification means for amplifying the audio electrical signal to a 
predetermined level; 

means for limiting the level of the amplified audio electrical signal to a 
predetermined output level; and 

filter means, connected to the limiting means, for limiting the audio 
electrical signal to a predetermined maximum frequency of interest and thereby 
providing the representative audio electrical signal. 

1 8. A sound recognition system as defined in claim 1 5. further comprising 
electronic means for receiving at least one word in a preselected language corresponding 
to the at least one phoneme sound type contained within the audio speech signal and for 
programmably processing the at least one word as either a data input or as a command 
input. 



1 9. A sound recognition system as defined in claim 1 5. wherein the said at least 
one tone domain characteristic includes a, least one of the following: a average amplitude 
of the audio speech signal; a absolute difference average of the audio speech signal; and 
a zero crossing rate of the audio speech signal. 
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20. A sound recognition system as defined in claim 15, wherein the said at least 
one frequency domain characteristic includes at least one of the following: a frequency of 
at least one of said filtered signals; and a amplitude of at least one of said filtered signals. 

21. A sound recognition system as defined in claim 15, wherein the digital 
sound processor means comprises: 

first programmable means for programmably executing a predetermined 
series of program steps; 

program memory means for storing the predetermined series of program 
steps utilized by said first programmable means; and 

data memory means for providing a digital storage area for use by said first 
programmable means. 

22. A sound recognition system as defined in claim 15, wherein the host sound 
processor means comprises: 

second programmable means for programmably executing a predetermined 
series of program steps; 

program memory means for storing the predetermined series of program 
steps utilized by said second programmable means; and 

data memory means for providing a digital storage area for use by said first 
programmable means. 

23. A sound recognition system for identifying the phoneme sound types that 
are contained within an audio speech signal, the sound recognition system comprising: 

audio processor means for receiving an audio speech signal and for 
converting the audio speech signal into a representative audio electrical signal; 

analog-to-digital converter means for digitizing the audio electrical signal 
at a predetermined sampling rate so as to produce a digitized audio signal; 

digital sound processor means for programmably carrying out the 
following program steps: 



WO 97/34293 



PCT/US96/03140 



57 

(a) performing a time domain analysis on the digitized audio 
signal so as to identify at least one time domain sound characteristic of 
said audio speech signal; 

(b) using a plurality of filter bands having predetermined cutoff 
frequencies, successively filtering the digitized audio signal into a 
corresponding plurality of separate time domain filtered signals; 

(c) measuring at least one frequency domain sound 
characteristic from each of said filtered signals; 

host sound processor means for programmably carrying out the following 
program steps: 

(a) based on the at least one time domain characteristic and the 
at least one frequency domain characteristic, identifying at least one 
phoneme sound type contained within the audio speech signal; and 

(b) translating said at least one phoneme sound type into at 
least one representative word of a preselected language. 

24. A sound recognition system as defined in claim 23 wherein the audio 
processor means comprises: 

means for inputting the audio speech signal and for converting it to an 
audio electrical signal; and 

means for conditioning the audio electrical signal so that it is in a 
representative electrical form that is suitable for digital sampling. 

25. A sound recognition system as defined in claim 24 wherein the 
conditioning means comprises: 

signal amplification means for amplifying the audio electrical signal to a 
predetermined level; 

means for limiting the level of the amplified audio electrical signal to a 
predetermined output level; and 

filter means, connected to the limiting means, for limiting the audio 
electrical signal to a predetermined maximum frequency of interest and thereby 
providing the representative audio electrical signal. 
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26. A sound recognition system as defined in claim 25, wherein the said at least 
one time domain characteristic includes at least one of the following: a average amplitude 
of the audio speech signal; a absolute difference average of the audio speech signal; and 
a zero crossing rate of the audio speech signal. 

27. A sound recognition system as defined in claim 26, wherein the said at least 
one frequency domain characteristic includes at least one of the following: a frequency of 
at least one of said filtered signals; and a amplitude of at least one of said filtered signals. 

28. A sound recognition system as defined in claim 27, wherein the at least one 
phoneme sound type contained within the audio speech signal is identified by comparing 
the at least one measured frequency domain characteristic to a plurality of sound standards 
each having an associated phoneme sound type and at least one corresponding standard 
frequency domain characteristic, wherein the at least one identified sound type is the 
sound standard type having a standard frequency domain characteristic that matches the 
measured frequency domain characteristic most closely. 

29. A sound recognition system as defined in claim 28, wherein the at least one 
measured frequency domain characteristic, and the plurality of standard frequency domain 
characteristics are expressed in terms of a chromatic scale 

30. A sound recognition system as defined in claim 29, further comprising 
electronic means for receiving the at least one representative word, and for programmably 
processing the at least one word as either a data input or as a command input. 

31. A method for identifying the phoneme sound types that are contained 
within an audio speech signal, the method comprising the steps of: 

(a) receiving an audio speech signal; 

(b) converting the audio speech signal into a representative audio 
electrical signal; 

(c) digitizing the audio electrical signal at a predetermined sampling 
rate so as to produce a digitized audio signal; 
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(d) performing a time domain analysis on the digitized audio signal so 
as to identify at least one time domain sound characteristic of said audio speech 
signal; 

(e) using a plurality of filter bands having predetermined cutoff 
frequencies, successively filtering the digitized audio signal into a corresponding 
plurality of separate time domain filtered signals; 

(f) measuring at least one frequency domain sound characteristic from 
each of said filtered signals; and 

(g) based on the at least one time domain characteristic and the at least 
one frequency domain characteristic, identifying at least one phoneme sound type 
contained within the audio speech signal. 

32. A sound recognition system as denned in claim 3 1 , wherein the said at least 
one time domain characteristic includes at least one of the following: a average amplitude 
of the audio speech signal; a absolute difference average of the audio speech signal; and 
a zero crossing rate of the audio speech signal. 

33. A sound recognition system as defined in claim 3 1 . wherein the said at least 
one frequency domain characteristic includes at least one of the following: a frequency of 
at least one of said filtered signals; and a amplitude of at least one of said filtered signals. 

34. A sound recognition system as defined in claim 3 1 , wherein the at least one 
phoneme sound type contained within the audio speech signal is identified by comparing 
the at least one measured frequency domain characteristic to a plurality of sound standards 
each having an associated phoneme sound type and a, least one corresponding standard 
frequency domain characteristic, wherein the a, least one identified sound type is the 
sound standard type having a standard frequency domain characteristic that matches the 
measured frequency domain characteristic most closely. 

35. A sound recognition system as defined in claim 34, wherein the at least one 
measured frequency domain characteristic, and the plurality of standard frequency domain 
characteristics are expressed in terms of a chromatic scale. 
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